JIIT Placement Alerts

Documentation

Back to Home
Home Projects JIIT Placement Alerts Data Processing & Content Extraction Email Processing & Classification

Email Processing & Classification

Table of Contents#

  1. Introduction

  2. Project Structure

  3. Core Components

  4. Architecture Overview

  5. Detailed Component Analysis

  6. Dependency Analysis

  7. Performance Considerations

  8. Troubleshooting Guide

  9. Conclusion

  10. Appendices

Introduction#

This document explains the email processing and classification system that powers the notification pipeline for general notices and updates. It covers:

  • Notice classification algorithms using LLM-based prompts

  • Email parsing pipeline (headers, bodies, metadata)

  • Notice types, priority, filtering, and content enrichment

  • Integration with email clients, authentication, and batch processing

  • Examples of processed data structures, accuracy considerations, and handling of malformed or suspicious emails

Project Structure#

The email processing system centers around a LangGraph pipeline that classifies incoming emails and extracts structured notices. It integrates with:

  • Google Groups client for fetching unread emails

  • LLM prompts for classification and extraction

  • Database persistence for notices and policy documents

  • Notification dispatch for Telegram and web push

graph TB subgraph "Email Sources" GG["GoogleGroupsClient
IMAP-based email fetch"] end subgraph "Processing" ENS["EmailNoticeService
LangGraph pipeline"] NFS["NoticeFormatterService
Classification & formatting"] PPS["PlacementPolicyService
Policy extraction"] end subgraph "Persistence" DB["DatabaseService
MongoDB Notices/Jobs/Policies"] DBC["DBClient
MongoDB connection"] end subgraph "Delivery" NS["NotificationService
Dispatch to channels"] TG["TelegramService"] WP["WebPushService"] end GG --> ENS ENS --> NFS ENS --> PPS NFS --> DB PPS --> DB DB --> NS NS --> TG NS --> WP

Diagram sources

Section sources

Core Components#

  • EmailNoticeService: Orchestrates LLM-based classification and extraction for general notices; handles placement policy detection and fallbacks.

  • NoticeFormatterService: Provides classification, enrichment, and formatting for notices and job postings.

  • GoogleGroupsClient: Decodes email headers, extracts forwarded metadata, and normalizes dates to IST.

  • PlacementPolicyService: Specialized extraction for placement policy updates into structured Markdown with TOC generation.

  • DatabaseService: Persists notices, jobs, policies, and manages deduplication and stats.

  • NotificationService: Routes formatted notices to Telegram and web push channels.

  • Webhook server: Exposes health, stats, and notification endpoints for external integrations.

Section sources

Architecture Overview#

The email processing pipeline follows a LangGraph workflow:

  • Input: Unread email IDs fetched from Google Groups

  • Nodes: classify → extract_notice → validate → display_results

  • Decision edges: classify → extract_notice if relevant; retry extraction on validation errors; skip if not a valid notice

  • Outputs: Persisted NoticeDocument or policy updates

sequenceDiagram participant Cron as "Scheduler/CLI" participant GGC as "GoogleGroupsClient" participant ENS as "EmailNoticeService" participant LLM as "ChatGoogleGenerativeAI" participant DB as "DatabaseService" Cron->>GGC : get_unread_message_ids() loop For each unread email Cron->>GGC : fetch_email(email_id) GGC-->>Cron : {subject, sender, body, time_sent} Cron->>ENS : process_single_email(email_data) ENS->>ENS : classify (always relevant) ENS->>LLM : NOTICE_EXTRACTION_PROMPT LLM-->>ENS : JSON notice or policy update alt Valid notice ENS->>ENS : validate fields ENS->>DB : save_notice() DB-->>ENS : success else Policy update ENS->>ENS : detect is_policy_update ENS->>PPS : process_policy_email() PPS-->>ENS : PolicyDocument end end Cron->>GGC : mark_as_read(email_id) (on success)

Diagram sources

Detailed Component Analysis#

EmailNoticeService: Classification and Extraction#

  • Classification: Always marks as relevant and delegates to LLM for classification; rejects irrelevant/spam/placement-offer emails.

  • Extraction: Uses a structured prompt to return JSON with fields for title, content, type, source, deadlines, links, and type-specific fields.

  • Validation: Ensures minimum length for title/content and presence of type.

  • Retry logic: Retries extraction up to two times on validation errors.

  • Policy detection: Detects placement policy updates and triggers a secondary extraction with a specialized prompt.

flowchart TD Start(["Start"]) --> Classify["Classify: always relevant"] Classify --> Decide{"Relevant?"} Decide --> |No| Display["Display rejection reason"] Decide --> |Yes| Extract["LLM extraction with structured prompt"] Extract --> Validate{"Validate notice fields"} Validate --> |Errors| Retry{"Retry < 2?"} Retry --> |Yes| Extract Retry --> |No| Reject["Reject with validation errors"] Validate --> |OK| Save["Create NoticeDocument"] Save --> DB["Persist via DatabaseService"] DB --> MarkRead["Mark email as read"] Display --> End(["End"]) Reject --> End MarkRead --> End

Diagram sources

Section sources

NoticeFormatterService: Classification, Matching, and Formatting#

  • Classification: Single-label classifier for categories: update, shortlisting, announcement, hackathon, webinar, job posting.

  • Matching: Extracts company names and fuzzy-matches against job listings for enrichment.

  • Formatting: Produces Telegram-ready messages with consistent structure, IST date formatting, and optional job enrichment.

flowchart TD A["Raw text from notice"] --> B["Classify category"] B --> C["Match company to jobs (fuzzy)"] C --> D["Extract structured info (JSON)"] D --> E["Format message (Telegram/markdown)"] E --> F["Attach attribution/footer"]

Diagram sources

Section sources

GoogleGroupsClient: Email Parsing and Metadata Extraction#

  • Fetches unread emails and parses multipart messages.

  • Extracts forwarded sender and forwarded date, normalizing to IST.

  • Provides safe marking/unmarking of emails as read/unread.

classDiagram class GoogleGroupsClient { +connect() imaplib +disconnect() void +get_unread_message_ids(folder) str[] +fetch_email(email_id, folder, mark_as_read) Dict~str,str~ +fetch_unread_emails(folder, mark_as_read) Dict[] -parse_email(connection, email_id) Dict~str,str~ -extract_body(msg) str +extract_forwarded_date(text) str +extract_forwarded_sender(text) str +mark_as_read(email_id) bool +mark_as_unread(email_id) bool }

Diagram sources

Section sources

PlacementPolicyService: Policy Extraction and Storage#

  • Detects placement policy emails and converts them into structured Markdown with TOC.

  • Generates slugs and validates years from content.

  • Upserts policy documents into MongoDB.

flowchart TD PStart["Receive policy email"] --> Detect["Detect is_policy_update"] Detect --> Extract["Extract policy JSON via LLM"] Extract --> Upsert["Upsert PolicyDocument in DB"] Upsert --> PEnd["Done"]

Diagram sources

Section sources

DatabaseService: Persistence and Deduplication#

  • Saves notices and jobs, deduplicates by ID, and tracks sent status.

  • Provides stats and retrieval helpers for unsent notices and collections.

classDiagram class DatabaseService { +notice_exists(notice_id) bool +get_all_notice_ids() set +save_notice(notice) (bool, str) +get_notice_by_id(id) Dict +get_unsent_notices() Dict[] +mark_as_sent(post_id) bool +get_notice_stats() Dict +structured_job_exists(id) bool +get_all_job_ids() set +upsert_structured_job(job) (bool, str) +get_all_jobs(limit) Dict[] +get_all_policies(limit) Dict[] +upsert_policy(policy) (bool, str) }

Diagram sources

Section sources

NotificationService: Channel Routing and Delivery#

  • Aggregates channels (Telegram, Web Push) and broadcasts messages to unsent notices.

  • Supports targeted channels and detailed per-channel results.

sequenceDiagram participant Runner as "NotificationRunner" participant NS as "NotificationService" participant DB as "DatabaseService" participant TG as "TelegramService" participant WP as "WebPushService" Runner->>NS : send_unsent_notices(telegram?, web?) NS->>DB : get_unsent_notices() loop For each notice NS->>TG : broadcast_to_all_users(formatted_message) NS->>WP : broadcast_to_all_users(formatted_message) NS->>DB : mark_as_sent(_id) end NS-->>Runner : Results summary

Diagram sources

Section sources

Webhook Server: External Integrations#

  • Exposes health, stats, push subscription, and notification endpoints.

  • Enables external systems to trigger update jobs and send notifications.

sequenceDiagram participant Ext as "External System" participant WH as "WebhookServer" participant NS as "NotificationService" participant DB as "DatabaseService" Ext->>WH : POST /webhook/update WH->>NS : send_unsent_notices(telegram=true, web=true) NS->>DB : get_unsent_notices() DB-->>NS : List of notices NS-->>WH : Results WH-->>Ext : {success, result}

Diagram sources

Section sources

Dependency Analysis#

  • EmailNoticeService depends on GoogleGroupsClient for fetching, LLM for classification/extraction, and DatabaseService for persistence.

  • NoticeFormatterService depends on Notice and Job models and uses LLM for classification and extraction.

  • PlacementPolicyService depends on DatabaseService for upserting policy documents.

  • NotificationService depends on channel implementations and DatabaseService for unsent notices.

  • Webhook server composes NotificationService and WebPushService for integrations.

graph LR ENS["EmailNoticeService"] --> GGC["GoogleGroupsClient"] ENS --> LLM["ChatGoogleGenerativeAI"] ENS --> DB["DatabaseService"] NFS["NoticeFormatterService"] --> DB PPS["PlacementPolicyService"] --> DB NS["NotificationService"] --> DB WH["WebhookServer"] --> NS WH --> WP["WebPushService"]

Diagram sources

Section sources

Performance Considerations#

  • Batch processing: The CLI orchestrator fetches unread IDs and processes emails sequentially, marking as read upon success to prevent reprocessing.

  • Retry strategy: Up to two retries for extraction failures to improve robustness.

  • Lazy enrichment: Jobs are enriched only when matched by the LLM, minimizing expensive API calls.

  • Connection reuse: GoogleGroupsClient connects per-operation and disconnects to avoid stale connections.

  • Logging and daemon mode: Centralized logging and daemon mode reduce overhead in production.

[No sources needed since this section provides general guidance]

Troubleshooting Guide#

Common issues and resolutions:

  • Authentication failures (IMAP): Verify placement email and app password environment variables.

  • LLM extraction errors: Inspect validation errors and retry attempts; ensure prompt compliance.

  • Database connectivity: Confirm MongoDB connection string and collection initialization.

  • Webhook endpoints: Check VAPID keys and CORS configuration for push endpoints.

  • Duplicate notices: DatabaseService deduplicates by ID; ensure consistent notice IDs.

Section sources

Conclusion#

The email processing and classification system leverages LLM-driven prompts to reliably extract and structure notices from multiple sources. It integrates seamlessly with email clients, enforces deduplication, and delivers notifications across channels while maintaining a modular, testable architecture.

[No sources needed since this section summarizes without analyzing specific files]

Appendices#

Data Model: ExtractedNotice and NoticeDocument#

  • ExtractedNotice: Fields include is_notice, rejection_reason, title, content, type, source, deadline, links, additional_info, and type-specific fields (students, company_name, role, package, location, eligibility_criteria, hiring_flow, job_type, event_name, topic, theme, speaker, date, time, registration_link, start_date, end_date, registration_deadline, prize_pool, team_size, organizer).

  • NoticeDocument: Stored representation with id, title, content, author, type, source, formatted_message, createdAt, updatedAt, sent_to_telegram, time_sent, deadline, links, students, students_count.

Section sources

Classification Criteria and Notice Types#

  • Notice types supported: announcement, hackathon, job_posting, shortlisting, update, webinar, reminder, internship_noc.

  • Classification is LLM-based; irrelevant/spam/placement-offer emails are rejected early.

  • Category classification for formatting: update, shortlisting, announcement, hackathon, webinar, job posting.

Section sources

Priority Assignment and Filtering#

  • Priority is implicit: placement offers are handled by a separate pipeline; general notices are processed after placement offers in the orchestrated email update flow.

  • Filtering: LLM determines relevance; validation ensures minimal quality; deduplication prevents repeated notifications.

Section sources

Integration with Email Clients and Authentication#

  • GoogleGroupsClient authenticates via IMAP and app password; supports fetching unread emails, extracting forwarded metadata, and marking read/unread.

  • Configuration: Environment variables for placement email and app password.

Section sources

Batch Processing Workflows#

  • CLI orchestrator: Iterates unread emails, tries placement offer detection first, then general notice extraction, persists results, and marks as read.

  • NotificationRunner: Sends unsent notices via Telegram and/or Web Push channels.

Section sources

Spam Detection and Duplicate Filtering#

  • Spam detection: LLM prompt explicitly rejects spam or irrelevant content.

  • Duplicate filtering: DatabaseService checks existence by ID before insertion.

Section sources

Content Enrichment and Formatting#

  • NoticeFormatterService enriches notices with job matching and formats Telegram-ready messages with IST timestamps and attribution.

  • Email date normalization: Forwarded dates and email Date headers are normalized to IST.

Section sources